class: center, middle, inverse, title-slide # Databases ### Kirill & Nicolas ### cynkra GmbH ### March 15, 2022 --- <style type="text/css"> .pull-left { margin-top: -25px; } .pull-right { margin-top: -25px; } .remark-code { font-size: 17px; } .font17 { font-size: 17px; } .font14 { font-size: 14px; } </style> # Introduction Organization of half-day R courses: - Intro courses: * Tidyverse intro I * Base R intro/Tidyverse intro II * Data visualization I * Data visualization II - Advanced courses: * Advanced tidyverse * R package creation * Working with database systems * Parallelization & efficient R programming * Databases (this course) --- # Course material Our course material currently is available from GitHub at https://github.com/cynkra/bag-courses Today we will be looking at the folder `5_databases` --- # Download course material  --- # General remarks - We hope for these courses to be interactive: go ahead and ask if something is unclear! - You can also write into the chat, which I will try to monitor when Kirill is presenting. - We were asked to provide recordings of the courses for those of you who cannot join, so recording is activated. - Per course unit, we offer 4 hours of follow up time; approach us with questions (nicolas@cynkra.com)! --- # Goals for today Playing the whole game! - Extract - Transform - Load - **Consume** --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - **Read whole tables** - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/11.webp") background-size: contain background-position: 100% 100% # Read whole tables .pull-left[ - {DBI} and {duckdb} package - Connect - Discover - Read - Query Script: `databases_11.R` ```r library(tidyverse) library(DBI) ``` ] --- background-image: url("data:image/png;base64,#images/11-frame.webp") background-size: contain background-position: 100% 100% # Read whole tables .pull-left[ - {DBI} package - Connect - Discover - Read - Query Script: `databases_11.R` ```r library(tidyverse) library(DBI) ``` ] --- # Connect to the database First step when accessing the database. ```r con_duckdb <- dbConnect(duckdb::duckdb()) con_duckdb ``` ``` <duckdb_connection 0faa0 driver=<duckdb_driver 22170 dbdir=':memory:' read_only=FALSE>> ``` --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Prepare database Normally done in a preparatory step. ```r dm::copy_dm_to( con_duckdb, dm::dm_pixarfilms(), set_key_constraints = FALSE, temporary = FALSE ) ``` --- # Discover tables Where is my data? ```r dbListTables(con_duckdb) ``` ``` [1] "academy" "box_office" "genres" [4] "pixar_films" "pixar_people" "public_response" ``` ```r dbListFields(con_duckdb, "pixar_films") ``` ``` [1] "number" "film" "release_date" [4] "run_time" "film_rating" ``` Caveat: schemas, catalogs, ... ```sql SELECT * FROM INFORMATION_SCHEMA.TABLES ``` --- # Read tables Read entire tables into your local session, if you can afford it. ```r df_pixar_films <- dbReadTable(con_duckdb, "pixar_films") df_pixar_films ``` ``` number film release_date run_time 1 1 Toy Story 1995-11-22 81 2 2 A Bug's Life 1998-11-25 95 3 3 Toy Story 2 1999-11-24 92 4 4 Monsters, Inc. 2001-11-02 92 5 5 Finding Nemo 2003-05-30 100 6 6 The Incredibles 2004-11-05 115 7 7 Cars 2006-06-09 117 8 8 Ratatouille 2007-06-29 111 9 9 WALL-E 2008-06-27 98 10 10 Up 2009-05-29 96 11 11 Toy Story 3 2010-06-18 103 12 12 Cars 2 2011-06-24 106 13 13 Brave 2012-06-22 93 14 14 Monsters University 2013-06-21 104 15 15 Inside Out 2015-06-19 95 16 16 The Good Dinosaur 2015-11-25 93 17 17 Finding Dory 2016-06-17 97 18 18 Cars 3 2017-06-16 102 19 19 Coco 2017-11-22 105 20 20 Incredibles 2 2018-06-15 118 film_rating 1 G 2 G 3 G 4 G 5 G 6 PG 7 G 8 G 9 G 10 PG 11 G 12 G 13 PG 14 G 15 PG 16 PG 17 PG 18 G 19 PG 20 PG [ reached 'max' / getOption("max.print") -- omitted 7 rows ] ``` --- # Read tables Use `as_tibble()` to convert to a tibble for better display and more robust operation. ```r df_pixar_films <- dbReadTable(con_duckdb, "pixar_films") as_tibble(df_pixar_films) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- # Execute queries Write SQL code to define what data you want to see. ```r dbGetQuery(con_duckdb, "SELECT * FROM pixar_films") ``` ``` number film release_date run_time 1 1 Toy Story 1995-11-22 81 2 2 A Bug's Life 1998-11-25 95 3 3 Toy Story 2 1999-11-24 92 4 4 Monsters, Inc. 2001-11-02 92 5 5 Finding Nemo 2003-05-30 100 6 6 The Incredibles 2004-11-05 115 7 7 Cars 2006-06-09 117 8 8 Ratatouille 2007-06-29 111 9 9 WALL-E 2008-06-27 98 10 10 Up 2009-05-29 96 11 11 Toy Story 3 2010-06-18 103 12 12 Cars 2 2011-06-24 106 13 13 Brave 2012-06-22 93 14 14 Monsters University 2013-06-21 104 15 15 Inside Out 2015-06-19 95 16 16 The Good Dinosaur 2015-11-25 93 17 17 Finding Dory 2016-06-17 97 18 18 Cars 3 2017-06-16 102 19 19 Coco 2017-11-22 105 20 20 Incredibles 2 2018-06-15 118 film_rating 1 G 2 G 3 G 4 G 5 G 6 PG 7 G 8 G 9 G 10 PG 11 G 12 G 13 PG 14 G 15 PG 16 PG 17 PG 18 G 19 PG 20 PG [ reached 'max' / getOption("max.print") -- omitted 7 rows ] ``` --- # Execute queries Write complex SQL code to define what data you want to see. ```r sql <- " SELECT * FROM pixar_films WHERE release_date >= '2020-01-01' " ``` ```r dbGetQuery(con_duckdb, sql) ``` ``` number film release_date run_time film_rating 1 22 Onward 2020-03-06 102 PG 2 23 Soul 2020-12-25 100 PG 3 24 Luca 2021-06-18 151 N/A 4 25 Turning Red 2022-03-11 NA N/A 5 26 Lightyear 2022-06-17 NA N/A 6 27 <NA> 2023-06-16 155 Not Rated ``` --- # Execute queries R 4.0 or later: use new-style string literals for mixing quotes. ```r sql <- r"( SELECT * FROM "pixar_films" WHERE "release_date" >= '2020-01-01' )" ``` ```r dbGetQuery(con_duckdb, sql) ``` ``` number film release_date run_time film_rating 1 22 Onward 2020-03-06 102 PG 2 23 Soul 2020-12-25 100 PG 3 24 Luca 2021-06-18 151 N/A 4 25 Turning Red 2022-03-11 NA N/A 5 26 Lightyear 2022-06-17 NA N/A 6 27 <NA> 2023-06-16 155 Not Rated ``` --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Further pointers .pull-left[ ## Quoting ```r dbQuoteIdentifier(con_duckdb, "academy") ``` ``` <SQL> "academy" ``` ```r dbQuoteLiteral(con_duckdb, "Toy Story") ``` ``` <SQL> 'Toy Story' ``` ```r dbQuoteLiteral(con_duckdb, as.Date("2020-01-01")) ``` ``` <SQL> '2020-01-01'::date ``` ```r glue::glue_sql(...) ``` ] .pull-right[ ## Parameterized queries ```r sql <- " SELECT count(*) FROM pixar_films WHERE release_date >= ? " dbGetQuery(con_duckdb, sql, params = list(as.Date("2020-01-01")) ) ``` ``` count_star() 1 6 ``` ] --- background-image: url("data:image/png;base64,#images/11-frame.webp") background-size: contain background-position: 100% 100% # Read whole tables: Exercises .pull-left[ 1. List all columns from the `box_office` table. 2. Read the `academy` table. 3. Read all records from the `academy` table that correspond to awards won - Hint: Use the query `"SELECT * FROM academy WHERE status = 'Won'"` 4. Use quoting and/or query parameters to stabilize the previous query. ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - **Let the database do the heavy lifting 1/2** - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/12.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database .pull-left[ - {dbplyr} package (part of the tidyverse) - Lazy tables - `collect()` - `select()` - `filter()` - `count()` Script: `databases_12_1.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/12-frame.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database .pull-left[ - {dbplyr} package (part of the tidyverse) - Lazy tables - `collect()` - `select()` - `filter()` - `group_by()`, `summarize()`, `ungroup()`, `count()` Script: `databases_12_1.R` ```r library(tidyverse) ``` ] --- # Lazy tables A pointer to a SQL table. The data is still on the database! ```r pixar_films <- tbl(con_duckdb, "pixar_films") pixar_films ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<pixar_films> [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> --- # Read the whole table ```r df_pixar_films <- pixar_films %>% collect() df_pixar_films ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- # Select columns With `select()`, like with data frames. .pull-left[ ```r pixar_films %>% select(1:3) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% select(1:3) %>% show_query() ``` ``` <SQL> SELECT "number", "film", "release_date" FROM "pixar_films" ``` ] --- # Select columns and read .pull-left[ ```r df_pixar_films_3 <- pixar_films %>% select(1:3) %>% collect() df_pixar_films_3 ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 3</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Select rows With `filter()`, like with data frames. .pull-left[ ```r pixar_films %>% filter(release_date >= "2020-01-01") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 5]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 22 Onward 2020-03-06 102 PG <span style='color: #BCBCBC;'>2</span> 23 Soul 2020-12-25 100 PG <span style='color: #BCBCBC;'>3</span> 24 Luca 2021-06-18 151 N/A <span style='color: #BCBCBC;'>4</span> 25 Turning Red 2022-03-11 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>5</span> 26 Lightyear 2022-06-17 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>6</span> 27 <span style='color: #BB0000;'>NA</span> 2023-06-16 155 Not Rated </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% filter(release_date >= "2020-01-01") %>% show_query() ``` ``` <SQL> SELECT * FROM "pixar_films" WHERE ("release_date" >= '2020-01-01') ``` ] --- # Select rows and read .pull-left[ ```r df_pixar_films_202x <- pixar_films %>% filter(release_date >= "2020-01-01") %>% collect() df_pixar_films_202x ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 6 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> 22 Onward 2020-03-06 102 PG <span style='color: #BCBCBC;'>2</span> 23 Soul 2020-12-25 100 PG <span style='color: #BCBCBC;'>3</span> 24 Luca 2021-06-18 151 N/A <span style='color: #BCBCBC;'>4</span> 25 Turning Red 2022-03-11 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>5</span> 26 Lightyear 2022-06-17 <span style='color: #BB0000;'>NA</span> N/A <span style='color: #BCBCBC;'>6</span> 27 <span style='color: #BB0000;'>NA</span> 2023-06-16 155 Not Rated </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Aggregate With `group_by()` + `summarize()` + `ungroup()`, like with data frames. .pull-left[ ```r pixar_films %>% group_by(film_rating) %>% summarize(n = n()) %>% ungroup() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> G 13 <span style='color: #BCBCBC;'>2</span> PG 10 <span style='color: #BCBCBC;'>3</span> N/A 3 <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% group_by(film_rating) %>% summarize(n = n()) %>% ungroup() %>% show_query() ``` ``` <SQL> SELECT "film_rating", COUNT(*) AS "n" FROM "pixar_films" GROUP BY "film_rating" ``` ] --- # Aggregate With `count()`, like with data frames. .pull-left[ ```r pixar_films %>% # # count(film_rating) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> G 13 <span style='color: #BCBCBC;'>2</span> PG 10 <span style='color: #BCBCBC;'>3</span> N/A 3 <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% # # count(film_rating) %>% show_query() ``` ``` <SQL> SELECT "film_rating", COUNT(*) AS "n" FROM "pixar_films" GROUP BY "film_rating" ``` ] --- # Aggregate and read .pull-left[ ```r df_pixar_films_by_rating <- pixar_films %>% count(film_rating) %>% collect() df_pixar_films_by_rating ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 4 × 2</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> G 13 <span style='color: #BCBCBC;'>2</span> PG 10 <span style='color: #BCBCBC;'>3</span> N/A 3 <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- background-image: url("data:image/png;base64,#images/12-frame.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database: Exercises 1 * Find several ways to select the 3 first columns * What happens if you include the name of a variable multiple times in a `select()` call? * Select all columns that contain underscores (use `contains()`) * Use `all_of()` to select 2 columns of your choice --- background-image: url("data:image/png;base64,#images/12-frame.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database: Exercises 2 .pull-left[ Find all films that 1. Are rated "PG" 2. Had a run time below 95 3. Had a rating of "N/A" or "Not Rated" 4. Were released after and including year 2020 5. Have a missing name (`film` column) or `run_time` 6. Are a first sequel (the name ends with "2") - Hint: Bring the data into the R session before filtering ] --- background-image: url("data:image/png;base64,#images/12-frame.webp") background-size: 40% background-position: 100% 100% # Downsizing on the database: Exercises 3 1. How many films are stored in the table? 2. How many films released after 2005 are stored in the table? 3. What is the total run time of all films? - Hint: Use `summarize(sum(...))`, watch out for the warning 4. What is the total run time of all films, per rating? - Hint: Use `group_by()` --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - **Let the database do the heavy lifting 2/2** - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/12_2.webp") background-size: 40% background-position: 100% 100% # Computing on the database .pull-left[ - {dbplyr} package (part of the tidyverse) - `mutate()` - `arrange()` - Grouped `summarize()` / `mutate()` Script: `databases_12_2.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/12_2-frame.webp") background-size: 40% background-position: 100% 100% # Computing on the database .pull-left[ - {dbplyr} package (part of the tidyverse) - `mutate()` - `arrange()` - Grouped `summarize()` / `mutate()` Script: `databases_12_2.R` ```r library(tidyverse) ``` ] --- # Transform With `mutate()`, like with data frames. .pull-left[ ```r pixar_films %>% select(film, release_date) %>% mutate(release_year = year(release_date)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>release_year</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> Toy Story 1995-11-22 <span style='text-decoration: underline;'>1</span>995 <span style='color: #BCBCBC;'> 2</span> A Bug's Life 1998-11-25 <span style='text-decoration: underline;'>1</span>998 <span style='color: #BCBCBC;'> 3</span> Toy Story 2 1999-11-24 <span style='text-decoration: underline;'>1</span>999 <span style='color: #BCBCBC;'> 4</span> Monsters, Inc. 2001-11-02 <span style='text-decoration: underline;'>2</span>001 <span style='color: #BCBCBC;'> 5</span> Finding Nemo 2003-05-30 <span style='text-decoration: underline;'>2</span>003 <span style='color: #BCBCBC;'> 6</span> The Incredibles 2004-11-05 <span style='text-decoration: underline;'>2</span>004 <span style='color: #BCBCBC;'> 7</span> Cars 2006-06-09 <span style='text-decoration: underline;'>2</span>006 <span style='color: #BCBCBC;'> 8</span> Ratatouille 2007-06-29 <span style='text-decoration: underline;'>2</span>007 <span style='color: #BCBCBC;'> 9</span> WALL-E 2008-06-27 <span style='text-decoration: underline;'>2</span>008 <span style='color: #BCBCBC;'>10</span> Up 2009-05-29 <span style='text-decoration: underline;'>2</span>009 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% select(film, release_date) %>% mutate(release_year = year(release_date)) %>% show_query() ``` ``` <SQL> SELECT "film", "release_date", EXTRACT(year FROM "release_date") AS "release_year" FROM "pixar_films" ``` ] --- # Transform and read .pull-left[ ```r df_pixar_films_with_release_year <- pixar_films %>% select(film, release_date) %>% mutate(release_year = year(release_date)) %>% collect() df_pixar_films_with_release_year ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 3</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>release_year</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> Toy Story 1995-11-22 <span style='text-decoration: underline;'>1</span>995 <span style='color: #BCBCBC;'> 2</span> A Bug's Life 1998-11-25 <span style='text-decoration: underline;'>1</span>998 <span style='color: #BCBCBC;'> 3</span> Toy Story 2 1999-11-24 <span style='text-decoration: underline;'>1</span>999 <span style='color: #BCBCBC;'> 4</span> Monsters, Inc. 2001-11-02 <span style='text-decoration: underline;'>2</span>001 <span style='color: #BCBCBC;'> 5</span> Finding Nemo 2003-05-30 <span style='text-decoration: underline;'>2</span>003 <span style='color: #BCBCBC;'> 6</span> The Incredibles 2004-11-05 <span style='text-decoration: underline;'>2</span>004 <span style='color: #BCBCBC;'> 7</span> Cars 2006-06-09 <span style='text-decoration: underline;'>2</span>006 <span style='color: #BCBCBC;'> 8</span> Ratatouille 2007-06-29 <span style='text-decoration: underline;'>2</span>007 <span style='color: #BCBCBC;'> 9</span> WALL-E 2008-06-27 <span style='text-decoration: underline;'>2</span>008 <span style='color: #BCBCBC;'>10</span> Up 2009-05-29 <span style='text-decoration: underline;'>2</span>009 <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Aggregate (complex) With `group_by()` + `summarize()` + `ungroup()`, like with data frames. .pull-left[ ```r pixar_films %>% group_by(film_rating) %>% summarize(mean_run_time = mean(run_time)) %>% ungroup() ``` ``` Warning: Missing values are always removed in SQL. Use `mean(x, na.rm = TRUE)` to silence this warning This warning is displayed only once per session. ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>mean_run_time</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> G 100. <span style='color: #BCBCBC;'>2</span> PG 101. <span style='color: #BCBCBC;'>3</span> N/A 151 <span style='color: #BCBCBC;'>4</span> Not Rated 155 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% group_by(film_rating) %>% summarize(mean_run_time = mean(run_time)) %>% ungroup() %>% show_query() ``` ``` <SQL> SELECT "film_rating", AVG("run_time") AS "mean_run_time" FROM "pixar_films" GROUP BY "film_rating" ``` ] --- # Transform and aggregate With `count()`, like with data frames. .pull-left[ ```r pixar_films %>% count(release_year = year(release_date)) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>release_year</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> <span style='text-decoration: underline;'>1</span>995 1 <span style='color: #BCBCBC;'> 2</span> <span style='text-decoration: underline;'>1</span>998 1 <span style='color: #BCBCBC;'> 3</span> <span style='text-decoration: underline;'>1</span>999 1 <span style='color: #BCBCBC;'> 4</span> <span style='text-decoration: underline;'>2</span>001 1 <span style='color: #BCBCBC;'> 5</span> <span style='text-decoration: underline;'>2</span>003 1 <span style='color: #BCBCBC;'> 6</span> <span style='text-decoration: underline;'>2</span>004 1 <span style='color: #BCBCBC;'> 7</span> <span style='text-decoration: underline;'>2</span>006 1 <span style='color: #BCBCBC;'> 8</span> <span style='text-decoration: underline;'>2</span>007 1 <span style='color: #BCBCBC;'> 9</span> <span style='text-decoration: underline;'>2</span>008 1 <span style='color: #BCBCBC;'>10</span> <span style='text-decoration: underline;'>2</span>009 1 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% mutate(release_year = year(release_date)) %>% count(release_year) %>% show_query() ``` ``` <SQL> SELECT "release_year", COUNT(*) AS "n" FROM (SELECT "number", "film", "release_date", "run_time", "film_rating", EXTRACT(year FROM "release_date") AS "release_year" FROM "pixar_films") "q01" GROUP BY "release_year" ``` ] --- # In-place aggregate (window functions) With `add_count()`, like with data frames. .pull-left[ ```r pixar_films %>% add_count(release_year = year(release_date)) %>% filter(n > 1) %>% select(film, release_year, n) %>% arrange(release_date) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Groups: release_year</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_year</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> Inside Out <span style='text-decoration: underline;'>2</span>015 2 <span style='color: #BCBCBC;'>2</span> The Good Dinosaur <span style='text-decoration: underline;'>2</span>015 2 <span style='color: #BCBCBC;'>3</span> Cars 3 <span style='text-decoration: underline;'>2</span>017 2 <span style='color: #BCBCBC;'>4</span> Coco <span style='text-decoration: underline;'>2</span>017 2 <span style='color: #BCBCBC;'>5</span> Onward <span style='text-decoration: underline;'>2</span>020 2 <span style='color: #BCBCBC;'>6</span> Soul <span style='text-decoration: underline;'>2</span>020 2 <span style='color: #BCBCBC;'>7</span> Turning Red <span style='text-decoration: underline;'>2</span>022 2 <span style='color: #BCBCBC;'>8</span> Lightyear <span style='text-decoration: underline;'>2</span>022 2 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% add_count(release_year = year(release_date)) %>% filter(n > 1) %>% arrange(release_date) %>% select(film, release_year, n) %>% show_query() ``` ``` <SQL> SELECT "film", "release_year", "n" FROM (SELECT "number", "film", "release_date", "run_time", "film_rating", "release_year", COUNT(*) OVER (PARTITION BY "release_year") AS "n" FROM (SELECT "number", "film", "release_date", "run_time", "film_rating", EXTRACT(year FROM "release_date") AS "release_year" FROM "pixar_films") "q01") "q02" WHERE ("n" > 1.0) ORDER BY "release_date" ``` ] --- background-image: url("data:image/png;base64,#images/12_2-frame.webp") background-size: 40% background-position: 100% 100% # Computing on the database: Exercises .pull-left[ 1. Add new columns `release_year` and `release_month`. 2. Use the new columns to compute the number of months since January 1970 for each film 3. Compute the overall median run time, and the median run time per film rating 4. For each film except the last, compute how many days have passed until the next film. - Hint: Use `lag(..., order_by = ...)` ] .pull-right[ 5. Find the maximum number of days between releases of two G and two PG films. ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - **Basic ETL for one table** - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/13.webp") background-size: 40% background-position: 100% 100% # Extract, Transform, Load .pull-left[ - Obtain raw data - Prepare database table - Write it to a new database - Consume Script: `databases_13.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/13-frame.webp") background-size: 40% background-position: 100% 100% # Extract, Transform, Load .pull-left[ - Obtain raw data - Prepare database table - Write it to a new database - Consume Script: `databases_13.R` ```r library(tidyverse) ``` ] --- # Extract: Raw data From arbitrary source: CSV, Excel, web API, other database, ... ```r pixar_films_raw <- pixarfilms::pixar_films pixar_films_raw ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 5</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Transform: Derived data - Fix type of `number` column - Extract `franchise` and `sequel` columns ```r pixar_films_clean ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 27 × 7</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>franchise</span> <span style='font-weight: bold;'>sequel</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story Toy Story 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life A Bug's Life <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 Toy Story 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. Monsters, Inc. <span style='color: #BB0000;'>NA</span> 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo Finding Nemo <span style='color: #BB0000;'>NA</span> 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles The Incredibles <span style='color: #BB0000;'>NA</span> 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars Cars 1 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille Ratatouille <span style='color: #BB0000;'>NA</span> 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E WALL-E <span style='color: #BB0000;'>NA</span> 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up Up <span style='color: #BB0000;'>NA</span> 2009-05-29 96 PG <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- # Create target database Depending on workflow, often an existing database is used. ```r db_path <- fs::path_abs("pixar.duckdb") db_path ``` ``` /Users/kirill/git/cynkra/bag/bag-courses/5_databases/pixar.duckdb ``` ```r fs::file_delete(db_path) con <- dbConnect(duckdb::duckdb(dbdir = db_path)) con ``` ``` <duckdb_connection b1490 driver=<duckdb_driver b5870 dbdir='/Users/kirill/git/cynkra/bag/bag-courses/5_databases/pixar.duckdb' read_only=FALSE>> ``` --- # Load: Write table to the database ```r dbWriteTable(con, "pixar_films", pixar_films_clean) dbExecute(con, "CREATE UNIQUE INDEX pixarfilms_pk ON pixar_films (film)") ``` ``` [1] 0 ``` ```r nrow(dbReadTable(con, "pixar_films")) ``` ``` [1] 27 ``` ```r dbDisconnect(con) ``` --- # Consume: share the file, open it .pull-left[ ```r fs::dir_info() %>% arrange(desc(birth_time)) %>% head(2) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 2 × 18</span> <span style='font-weight: bold;'>path</span> <span style='font-weight: bold;'>type</span> <span style='font-weight: bold;'>size</span> <span style='font-weight: bold;'>permissi…</span> <span style='font-weight: bold;'>modification_time</span> <span style='color: #949494; font-style: italic;'><fs::path></span> <span style='color: #949494; font-style: italic;'><fct></span> <span style='color: #949494; font-style: italic;'><fs::></span> <span style='color: #949494; font-style: italic;'><fs::per></span> <span style='color: #949494; font-style: italic;'><dttm></span> <span style='color: #BCBCBC;'>1</span> …ar.duckdb.wal file 1.84K rw-r--r-- 2022-03-13 <span style='color: #949494;'>16:02:25</span> <span style='color: #BCBCBC;'>2</span> pixar.duckdb file 12K rw-r--r-- 2022-03-13 <span style='color: #949494;'>16:02:25</span> <span style='color: #949494;'># … with 13 more variables: </span><span style='color: #949494; font-weight: bold;'>user</span><span style='color: #949494;'> <chr>, </span><span style='color: #949494; font-weight: bold;'>group</span><span style='color: #949494;'> <chr>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>device_id</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>hard_links</span><span style='color: #949494;'> <dbl>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>special_device_id</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>inode</span><span style='color: #949494;'> <dbl>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>block_size</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>blocks</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>flags</span><span style='color: #949494;'> <int>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>generation</span><span style='color: #949494;'> <dbl>, </span><span style='color: #949494; font-weight: bold;'>access_time</span><span style='color: #949494;'> <dttm>,</span> <span style='color: #949494;'># </span><span style='color: #949494; font-weight: bold;'>change_time</span><span style='color: #949494;'> <dttm>, </span><span style='color: #949494; font-weight: bold;'>birth_time</span><span style='color: #949494;'> <dttm></span> </CODE></PRE> ] .pull-right[ ```r con <- dbConnect(duckdb::duckdb(dbdir = db_path, read_only = TRUE)) pixar_films <- tbl(con, "pixar_films") pixar_films ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<pixar_films> [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>franc…</span> <span style='font-weight: bold;'>sequel</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><int></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story Toy S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> 2 A Bug's L… A Bug… <span style='color: #BB0000;'>NA</span> 1998-11-25 95 G <span style='color: #BCBCBC;'> 3</span> 3 Toy Story… Toy S… 2 1999-11-24 92 G <span style='color: #BCBCBC;'> 4</span> 4 Monsters,… Monst… <span style='color: #BB0000;'>NA</span> 2001-11-02 92 G <span style='color: #BCBCBC;'> 5</span> 5 Finding N… Findi… <span style='color: #BB0000;'>NA</span> 2003-05-30 100 G <span style='color: #BCBCBC;'> 6</span> 6 The Incre… The I… <span style='color: #BB0000;'>NA</span> 2004-11-05 115 PG <span style='color: #BCBCBC;'> 7</span> 7 Cars Cars 1 2006-06-09 117 G <span style='color: #BCBCBC;'> 8</span> 8 Ratatouil… Ratat… <span style='color: #BB0000;'>NA</span> 2007-06-29 111 G <span style='color: #BCBCBC;'> 9</span> 9 WALL-E WALL-E <span style='color: #BB0000;'>NA</span> 2008-06-27 98 G <span style='color: #BCBCBC;'>10</span> 10 Up Up <span style='color: #BB0000;'>NA</span> 2009-05-29 96 PG <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- background-image: url("data:image/png;base64,#images/13-frame.webp") background-size: 40% background-position: 100% 100% # Exercises 1. Adapt the ETL workflow to convert the `run_time` column to a duration. - Hint: Use `mutate()` with `hms::hms(minutes = ...)` . 2. Re-run the workflow. --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - **Subtle issues to watch out for** ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/14.webp") background-size: 40% background-position: 100% 100% # Caveats .pull-left[ - Lazy tables vs. data frames/tibbles - Order - Logical/Boolean data type - Imperfect translation - Materialization Script: `databases_14.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/14-frame.webp") background-size: 40% background-position: 100% 100% # Caveats .pull-left[ - Lazy tables vs. data frames/tibbles - Order - Logical/Boolean data type - Imperfect translation - Materialization Script: `databases_14.R` ```r library(tidyverse) ``` ] ---